Is Knowledge-Free Induction of Multiword Unit Dictionary Headwords a Solved Problem?

نویسندگان

  • Patrick Schone
  • Daniel Jurafsky
چکیده

We seek a knowledge-free method for inducing multiword units from text corpora for use as machine-readable dictionary headwords. We provide two major evaluations of nine existing collocation-finders and illustrate the continuing need for improvement. We use Latent Semantic Analysis to make modest gains in performance, but we show the significant challenges encountered in trying this approach.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

1386 Xix

A dictionary’s macrostructure, nomenclature or lemma-sign list is, in simple terms, the inventory of all the headwords in that dictionary. Each of those lemma signs (headwords) is a canonical form, representing an entire paradigm of morphologically-related forms. The use of dictionary citation forms to group related forms is both a space-saving device that stems from the times of the paper dict...

متن کامل

Turkish Electronic Living Lexicon (TELL): A Lexical Database

The purpose of the TELL project is to create a database of Turkish lexical items which reflects actual speaker knowledge, rather than the normative and phonologically incomplete dictionary representations on which most of the existing phonological literature on Turkish is based. The database, accessible over the internet, should greatly enhance phonological, morphological, and lexical research ...

متن کامل

Compilation strategies for pedagogically effective bilingual learner’s dictionaries

Chinese, Japanese and Arabic bilingual dictionaries suffer from shortcomings rarely seen in works of other major languages. These include archaic headwords and senses, an overly prescriptive approach, learner-unfriendly sense ordering, and the omission of important multiword expressions. This paper describes how three bilingual learner’s dictionaries address these issues, focusing on compilatio...

متن کامل

Extracting Multiword Translations from Aligned Comparable Documents

Most previous attempts to identify translations of multiword expressions using comparable corpora relied on dictionaries of single words. The translation of a multiword was then constructed from the translations of its components. In contrast, in this work we try to determine the translation of a multiword unit by analyzing its contextual behaviour in aligned comparable documents, thereby not p...

متن کامل

Knowledge-Free Induction of Morphology Using Latent Semantic Analysis

Morphology induction is a subproblem of important tasks like automatic learning of machine-readable dictionaries and grammar induction. Previous morphology induction approaches have relied solely on statistics of hypothesized stems and affixes to choose which affixes to consider legitimate. Relying on stemand-affix statistics rather than semantic knowledge leads to a number of problems, such as...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001